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A number of predictors have been suggested to detect the most influential spreaders of information in 
online social media across various domains such as Twitter or Facebook. In particular, degree, PageRank, 
k-core and other centralities have been adopted to rank the spreading capability of users in information 
dissemination media. So far, validation of the proposed predictors has been done by simulating the 
spreading dynamics rather than following real information flow in social networks. Consequently, only 
model- dependent contradictory results have been achieved so far for the best predictor. Here, we address 
this issue directly. We search for influential spreaders by following the real spreading dynamics in a wide 
range of networks. We find that the widely-used degree and PageRank fail in ranking users' influence. We 
find that the best spreaders are consistently located in the k-core across dissimilar social platforms such as 
Twitter, Facebook, Livejournal and scientific publishing in the American Physical Society. Furthermore, 
when the complete global network structure is unavailable, we find that the sum of the nearest neighbors' 
degree is a reliable local proxy for user's influence. Our analysis provides practical instructions for optimal 
design of strategies for "viral" information dissemination in relevant applications. 

I nformation spreading is an ubiquitous process in society which describes a wide variety of phenomena ranging 
I from the adoption of innovations 1 , the success of commercial promotions 2 , the rise of political movements 3 , 
I and the spread of news, opinions and brand new products in society 4,5 . In these phenomena, starting from a few 
'seeds', the information will diffuse from person to person contagiously and may eventually spread through the 
majority of population in a "viral" way 6 ". As such, how people contact with one other in real life, as portrayed by a 
social network 911 , should be of great significance in information spreading process. From the early days of 
research of information diffusion processes, it has been accepted that some influential individuals stand out 
due to their prominent ability to shape opinion of large populations 12 . The ability to start such a "viral" spreading 
process is attributed to the spreaders' unique location in the underlying social network 13 20 . Targeting these vital 
people in information dissemination is helpful for designing strategies for either accelerating the speed of 
propagation in the case of product promotion, or hindering the diffusion of rumors in online social networks 
as well as diseases in contact networks. Therefore, identification of privileged spreaders is of great practical 
importance and has attracted much attention. Indeed, several approaches to locating influential spreaders are 
developed in the context of social science, either from the algorithm aspect 21 , or from the view of topology and 
dynamical modeling 7 . 

Searching for individual superspreaders of information is commonly implemented by ranking the users in 
terms of topological measures. Consequently, a reliable and efficient topological predictor is indispensable in 
locating capable nodes for spreading. However, so far there is no consensus on the best predictor of influence. A 
number of different measures aimed at identifying influential spreaders were suggested over the years 22 . The most 
prominent ones include degree 23,24 , PageRank 25 , betweenness centrality 26 , and k-core (also called k-shell, denoted 
by ks) 27 ' 32 : (a) Degree (number of connections of a node) is the most direct and widely-used topological measure 
of influence. In a social network with a broad degree distribution 23 , the most connected people or hubs are usually 
believed to be responsible for the largest diffusion processes 23,24 , (b) PageRank is a network-based diffusion 
algorithm which describes a random walk process on hyperlinked networks. Although, it was originally proposed 
to rank content in the World Wide Web and stimulated the revolution in the web search industry contributing to 
the emergence of the search giant Google, PageRank is applied in many circumstances to rank an extensive array 
of data 33 . Due to its simple assumptions, straightforward implementation and relatively low computational 
complexity, researchers are inspired to use PageRank to identify pivotal individuals in social networks in many 
practical situations 34_38 .(c) In the social network context, betweenness centrality is defined as a measure of how 
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many shortest paths cross through a node 26 , (d) Finally, k-core 
describes the location of a person in the social network by assigning 
to each node a k s - index obtained by iteratively pruning all the nodes 
in the network with k £ fc s 27 ~ 32 . Periphery nodes correspond to small 
k s and the largest value of k s defines the network k-core. Among 
these measures, PageRank, betweenness centrality and k-core are 
global indices since their calculation requires the complete network 
structure as opposed to the local degree (details in Methods). 

Unfortunately, unavailability of the full content diffusion record in 
complete networks prevented so far straightforward validation of the 
efficiency of such measures and comparison of different approaches. 
This difficulty led previous works to rely on artificial stochastic mod- 
els in studies of content diffusion. In fact, a drawback of previous 
studies of spreaders is that validation of the proposed predictors has 
been done by modeling the spreading of information in a given 
network, rather than by using the real spreading dynamics. This fact 
has led to an intense debate in the literature with a number of papers 
claiming contradictory results on the best predictor of influence 
according to the particular modeling used to simulate the spreading 
process. Models include, for instance, random walks for PageRank 25 , 
susceptible-infectious-recovered (SIR) and susceptible-infectious- 
susceptible (SIS) models for information and disease spreading 39 as 
well as rumor, threshold and cascading models in opinion spread- 
ing 7,21,40 . These epidemiology-inspired models are typically based on 
very simplified assumptions of human behavior that may not be 
representative of the actual information spreading dynamics in a real 
setting. As a consequence, they give rise to model-dependent pre- 
dictors for the best spreaders. For example, in the simulations of SIR 
and SIS models on real-world networks, k-core outperforms other 
measures like degree and betweenness centrality 32 . Whereas, for the 
model of rumor dynamics, k-core becomes invalid due to the absence 
of influential spreaders 40 . Moreover, observational studies tracking 
actual diffusion processes suggest that prediction relying on models 
does not work well in practice 4145 . In particular, these models usually 
fail to account for such key elements affecting information consump- 
tion as user activity, individual interests and the distribution of these 
properties in the network (i.e. assortative mixing). This modeling 
approach led to a large number of diverse, frequently contradicting 
predictions for performance of the influential spreaders and seeding 
strategies. The very assumption that the network topology can pre- 
dict the spreading performance of the individual user was never 
reliably validated. These issues motivated us to empirically test the 
variety of suggested predictors of influence using real information 
diffusion dynamics to find practical and reliable topological identi- 
fiers of superspreaders of information. 

The lack of empirical validation so far can be understood due to 
the difficulty in measuring at the same time the full network links 
between users (for instance, all the followers in Twitter) and the 
diffusion of information (for instance all the tweets and retweets in 
a given time window). Here, we solve this issue by presenting a full 
empirical investigation of superspreaders of information performed 
by following the real diffusion dynamics in some of the most import- 
ant online social networks to date. The empirical novelty of our 
analysis is that, in this setting, the influence exerted by the innova- 
tors, leaders and influential individuals in the existing communities 
can be precisely quantified. In this sense, a detailed experimental 
study of the conditions necessary for the raise of superspreaders of 
information can be performed. 

Contrary to common belief, although PageRank is effective in 
ranking web pages, there are many situations where it fails to locate 
superspreaders of information in reality. Furthermore, we find that 
the degree of the user is not a reliable predictor of influence in all 
circumstances. With extensive datasets from a blog website, 
LiveJournal.com, microblogging service, Twitter.com, online social 
network, Facebook.com, and scientific dissemination, journals of 
American Physical Society (APS), we consistently find that the best 



spreaders are located in the k-core. The k-core does not only predict 
the average influence of users better than other predictors, but also 
recognizes the top performing spreaders more accurately. Moreover, 
since k-core is a global measure, it is inconvenient to evaluate for 
large scale networks. To solve this problem, we find a simple, yet 
effective, local proxy for users' influence - the sum of the nearest 
neighbors' degrees, whose performance can be comparable with that 
of the global measure k-core. 

Results 

Test of predictors in real information flow processes. To eliminate 
the dependence of superspreader identification on the particular 
model used to simulate the dynamics, we study the problem of 
ranking spreaders by following the real- dynamics of information 
diffusion in real-world social networks. Tracking actual diffusion 
processes in social systems is a rather difficult task as it requires 
the complete record of the social network structure as well as the 
entire history of the diffusing content. Spreading in such systems can 
be viewed in terms of two layers: the underlying social network and 
diffusion processes embedded in population, see Fig. la. Considering 
the large scale of modern online social networks, privacy policies of 
clients and diversity of information diffusion patterns, the necessary 
information may not be available for most social networks. 
Consequently, in the absence of the record of the spreading 
content, earlier research mostly modeled diffusion with, for 
instance, SIR or rumor spreading models rather than studying 
directly the real diffusion. The outcomes of such work can be 
highly sensitive to the underlying model assumptions. Considering 
the complexity of the cognitive, social and structural processes 
involved in society-scale information spreading dynamics, it is 
essential to empirically validate the outcomes of such research. 

To this end, we have collected the full information dynamics and 
topological network structure of a large dataset representing pub- 
lic blog posts published at LiveJournal.com (LJ), a well-known 
online community of bloggers (all datasets used in this work are 
available at http://lev.ccny.cuny.edu/~hmakse/soft_data.html). Pre- 
vious research has shown that this network has characteristics con- 
sistent with other large-scale social networks 46,47 . In LJ, each user 
maintains a friend list, which represents social ties to other LJ users. 
The network composed of these social links is believed to reliably 
represent the actual social relations of the LJ users 6 . In terms of the LJ 
social network, the presence of user i in user/s friend list represents a 
directed link from j to i. Similarly to Twitter and Facebook, such links 
help LJ users to track the information published by their peers. In 
fact, the LJ engine generates a special page accumulating updates 
from all users in one's such friend list. We have crawled the friend 
list of all users, resulting in a complete social network containing 
about 9.6 million users (see Table I). In addition, considering that one 
of the LJ's main function is to facilitate diffusion of content, we have 
collected all the available blog posts from February 14th, 2010 to 
November 21st, 2011. In particular, we gathered 56,180,137 posts 
published by the LJ users. 

LJ users maintain the custom of referencing the original post once 
they refer to other user's information. As a result, we can directly 
track the information passed from one user to another. For instance, 
if user i's post contains links to user/s post, we infer that information 
spreads from j to i. We identified 598,833 posts that contained links 
to other posts published by LJ users and defined a diffusion link from 
j to i if i cited fs blog at least once. In this way, we obtain a directed 
unweighted diffusion graph representing information spread in LJ 
during the observation period. 

We should note that the LJ data is nearly perfect to test ranking of 
spreaders. The complete network structure enables us to test the 
necessary network measures (such as PageRank and k-core) accur- 
ately. Moreover, explicit reference to peer's publication makes post 
attributable to specific users with measurable network properties. 
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Figure 1 | Schematic illustrations for diffusion process and network structure, (a), A schematic illustration of two-layer structure of connectivity and 
diffusion. The lower layer displays social network while the upper layer represents the information diffusion, (b), An example of a diffusion instance 
starting from source node s. The influence region of s shaded in green contains 5 nodes, (c), The Zc-shell structure of LJ social network. The k s indices 
increase as we move from the periphery to the center. The node's degree is reflected by its size. Here we highlight four hubs located in the periphery of 
network. This inset is created with the Lanet-vi tool (http://lanetvi.soic.indiana.edu/lanetvi.php). (d-f), The influence of the spreading process cannot be 
predicted by degree reliably. For the LJ network, we compare the influence area of single nodes with the same degree k = 6902 (nodes A and B) or the same 
index k s = 230 (nodes A and C). In the lower level of the corresponding plots, nodes' fc-shell indexes are marked with different colors. In the upper level, 
nodes with green color constitute the influence area, while the grey nodes are not influenced by the source node. 



Without such custom, it is difficult to distinguish between contagion- 
like diffusion attributable to network users and diffusion of content 
coming from external information sources, like newspapers or news 
channels. 

We find that the diffusion graph is quite different from the under- 
lying social network. First, the size of the diffusion graph is relatively 
small compared to the size of the underlying social network: only 
246,423 users are actively involved in the diffusion processes. 
Although the remaining users belong to the social network, they 
may be inactive or unwilling to disseminate the information. This 
dynamics is particularly suitable for the research of spreaders, 
because it highlights the roles of individual users and their roles of 
the underlying network. If the spreading content could routinely 
reach large fraction of the population, the topological location of 
the source node and the network layout would no longer be import- 
ant. Second, we find that the diffusion processes do not always follow 
social network links, as reported recently 48 . Contrary to the common 
assumption that information diffuses through social connections in 



dynamical modeling like SIR or SIS models, there are situations 
where information spreads between two users even if they are not 
connected by a social link. In the case of LJ, only 31.93% of the 
spreading posts can be attributed to the observable social links. 
The reason for this effect is that the posts can be found via search 
engines or promoted by the LJ engine even if the author and the 
reader are not directly connected. This observation questions the 
relevance of the social network measures for studies of the content 
diffusion processes occurring on top of these networks. Perhaps, the 
reliance on the network properties is not justified if the actual dif- 
fusion is not confined to the underlying network. In this work we 
specifically test the capacity of the individual user attributes com- 
puted from the explicit social network to predict the user's ability to 
disseminate content in the system. 

In reality, a piece of information usually starts from one or few 
independent sources. Then some of the system users repost this 
information referencing the origin so that it is passed on to their 
friends. This process is observed repeatedly resulting in system-wide 
diffusion 7,8 . Having this process in mind, we follow the diffusion links 
starting from each node i in LJ and identify the first-layer users who 
have passed user i's information to their neighbors. Then we track the 
diffusion links originating from these users and so forth until the 
entire diffusion cascade is recovered. The resulting set of nodes 
represents the region of influence for node i. Although the content 
of the diffusing information may mutate as it is passed between the 
nodes in the region of influence, the source node i is assumed to be 
responsible to the entire cascade. We therefore quantify the impact of 
the node i to the information spreading process as the number of the 
users in the region of influence and denote that quantity as M,. 
Figure lb exemplifies the calculation of M,. Starting from the source 
node s, we track the diffusion links layer by layer in a breadth-first- 
search (BFS) fashion. To eliminate the effect of loops, from one layer 



Table I | Properties of the real-world networks studied in this work. 
Here Nis the number of nodes, Nfis the number of edges, (k; n ) is 
the average in-degree of the network, and Nj is the number of 
nodes involved in diffusion. For undirected networks, (/c, n ) repre- 
sents average degree 

Networks Type N N E <fc/„) Nd 

Livejournal directed 9573126 188240039 19.7 246423 

APS undirected 162142 1306506 16.1 29814 

Facebook directed 63731 1545685 24.3 35813 

Twitter directed 2870418 4772477 1.7 901949 
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to the next layer, only newly covered nodes are put in the search 
queue. In this example, the search lasts for 3 layers and node s has 
influence of M s = 5. Notice that the diffusion graph represents all the 
diffusion processes during the observation period, so M, is the overall 
influence for all the posts of user i. 

As a result of rich topological structures of LJ social network, not 
all the hubs are located in the core region 32 . In Fig. lc, we highlight 
hubs located in the periphery with black squares. Figures ld-e illus- 
trate that the influence M, is not necessarily determined by the degree 
of the spreading origin. Influence area can be rather different even 
when spreading starts from the hubs of similar degree as shown in 
Fig. Id and Fig. le. Instead, we find that the location of the origin 
given by its /c s -index predicts the influence more accurately, as pre- 
sented in Fig. Id and Fig. If. Figures 2a and b display the comparisons 
of k-core and two other centralities: in-degree k in and PageRank. 

In a network with N nodes, the topological structure is described 
by the adjacency matrix A = {djj} NXN , where a t j = 1 if user i is in the 
friend list of user j, and a,j = 0 otherwise. For node i, in-degree is 

defined by k in {i) = y ,_ t a t j. In fact, in LJ k in (i) is the number of user 

i's followers who have direct access to i's posts. PageRank 25 mimics a 
random walker in hyperlinked networks, and quantifies nodes' rela- 
tive influence by considering the importance of their neighbors 
recursively (see details in Methods). Here we do not consider the 
betweenness centrality because it is infeasible to calculate for large 
scale social networks. Currently, the most efficient solution is 
Brandes' algorithm 49 which takes complexity O(nm). In our case, 
for LJ social network with nearly 10 million nodes and 200 million 
links, it is impossible to obtain the betweenness centrality in a reas- 
onable time. Besides, previous research on SIR and SIS modeling 32 
suggests that betweenness centrality does not work well in identifying 
best spreaders. 

For the users involved in information diffusion, we calculate the 
average influence M(k s , k jn ) for nodes with a given combination of k s 
and k-,„: 



M(k s ,k,„) = 



M: 



ieT(fc s ,fc,„) 



N(k s ,k„ 



(1) 



Here Y(ks,k in ) is the collection of all the users participating in dif- 
fusion in the (k s , k, n ) bin, and N(k s , k, n ) is the number of these users. 
Then we take their logarithmic value (base 10) and display them in 
Fig. 2a. To eliminate extreme cases, we only display results of data 
bins with N(k s , k in ) > 20. It is observed that for nodes with fixed in- 
degree, the influence can be either large or small. Meanwhile, the 
nodes located in the same k s shell have similar influence. In order to 
have a direct view of this observation, we compare the variation of 
influence for nodes within fixed measure intervals. We divide the 
range of measures into 5 bins equally according to the logarithmic 
values, and then calculate the standard deviation of influence s(M) 
for nodes within each bin. Concretely, take the in-degree as an 
example, the interval 0%-20% in Fig. 3a means the in-degree range 
k m in s hn - k min + exp[20% hg(k max - k min )}, where k min and k max 
stand for the minimal and maximal in-degree respectively. 
Therefore, intervals in Fig. 3a correspond to stripes with same width 
in the y-axis of Fig. 2a. In Fig. 3a, we find that the standard deviation 
of influence is smaller for A: s , which is in accordance with our obser- 
vation in Fig. 2a. However, the standard deviation is relatively high 
compared with the average influence. This means, in reality, that the 
diffusion processes are quite random. Similar results on the efficiency 
of high-/c s nodes are obtained from the analysis of M(k s , PageRank) 
in Fig. 2b. 

Figure 2a shows that there are hubs in the periphery (small k-core 
values) with small influence. However, how many such hubs exist is 
not quantified in this figure. Taking the number of such hubs into 
consideration, we compare the average influence M(f) of the top /- 
fraction nodes for each measure in Fig. 3b. If there is a large number 



of hubs with small k-core values, the average influence M(f) for in- 
degree will be smaller than that of k-core. For the nodes involved in 
spreading, we rank the users according to different measures, select 
the nodes ranked in the top /-fraction, and calculate their average 
influence. In the case of in-degree, for instance, we rank the in-degree 
decreasingly k^ > • • • > k itl ^ (N d is the number of nodes participating 
in diffusion). Then the top /-fraction nodes of in-degree are the users 
$1,12, ' ' • > ! U / Njj- As the fraction/increases, there will be more nodes 
with smaller influence selected, so the average influence M(f) 
decreases as / grows. The error bar is the 95% confidence interval 
obtained by bootstrap 50 (see details in Methods). On average, the 
nodes with higher k s can trigger larger diffusion than those with 
higher indegree and PageRank. To better interpret this, in Fig. 3c, 
we display the ratio between average influence M(J) of k s and M(f) of 
the other two measures respectively. For both in-degree and 
PageRank, this ratio keeps above 1 for almost all the fraction /. 
This means that, in most cases, the nodes with high k s have larger 
influence than nodes with high k in and PageRank. 

Despite that k s can predict the average influence well, since the 
influence for single nodes has large fluctuations, whether k s can 
better locate individual superspreaders is still not clear. Therefore, 
we check the performance of each measure in recognizing influential 
spreaders directly. We define the recognition rate r(f) as: 



K/) = 



(2) 



where If and Pf are the sets of nodes ranking in the top /fraction by 
influence and predictor respectively, and \If\ is the number of nodes 
in If. Taking If as an example, we rank nodes' influence in a descend- 
ing order M tl > • • • > M iNj . Then If is the set of nodes with labels 
$1,$2, ■ ■ ■ ,i L | Njj- Similarly, we define the set ofPyfor k-core, in-degree 
and PageRank. Figure 4a shows that the recognition rate for k s is 
larger than in-degree and PageRank. This direct evidence supports 
that k s can indeed find more superspreaders than k in and PageRank. 
Therefore, k-core is more practical in predicting influential nodes. 

The invalidity of degree and PageRank can be explained as follows. 
The degree only considers the number of nearest neighbors of a user. 
If a hub is located in the periphery of a network, it may have an 
insignificant impact in the spreading process 32 , since its neighbors 
are limited in spreading capability. As for PageRank, it is frequently 
used to identify efficient spreaders based on the assumption that 
content spreads randomly in the network. However, in reality the 
information diffusion paths are not random walks 51 - the informa- 
tion spreading is not totally random in the sense that certain peers are 
more likely to be chosen by the walker than others. This may intro- 
duce significant discrepancy between the PageRank predictions and 
the actual outcomes. The k-core approach, on the other hand, is 
simply aimed at maximizing access - the number of easily reachable 
nodes. The empirical evidence supports that, this straightforward 
approach is more effective than the ones based on specific assump- 
tions on the dynamical processes such as PageRank. 

Apart from the extensive data of LJ, we also explore the dissem- 
ination of scientific information in the publications of the APS jour- 
nals. In the context of social networks, research not only addresses 
the problem of blogs diffusion, but also concerns the dissemination 
of innovations, such as scientific ideas published in research papers 1 . 
Thus, with the APS database, we intend to analyze another type of 
spreading, i.e. dissemination of scientific ideas, to see if we can obtain 
general conclusions on locating superspreaders across different types 
of spreading dynamics. 

The dataset of APS journals {Physical Review A, B, C, D, E and 
Physical Review Letters) includes the information of authors and 
citations for all the publications until 2005, including 247,676 scient- 
ific papers. The social network is formed by co-authorship, i.e., if 
author i writes a paper with author j, an undirected social link is 
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Figure 2 | The ic-shell index predicts the average influence of spreading more reliably than in-degree and PageRank. Logarithmic values of average size 
of influence region M(k s , fc,„) when spreading originates in nodes with (k s , k in ) for LJ (a), APS (c), Facebook (e) and Twitter (g) are shown. The same 
analysis with PageRank is also presented in (b),(d),(f),(h). In general, spreading is larger for nodes of higher k s , whereas nodes of a given k in or 
PageRank can result in either small or large spreading. 
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interval top fraction, f top fraction, f 



Figure 3 | Nodes with high ic-shell have larger average influence than those with high in-degree and PageRank. (a), The standard deviation of influence 
s(M) for nodes within each interval for LJ. The data intervals are created by dividing the range of measures equally according to the logarithmic 
values, (b), The average influence M(f) for nodes ranking in top /fraction by fc-shell k s , in-degree k in and PageRank p for LJ data, (c), The ratio between the 
average influence of nodes within top /fraction of k s and that of the other two measures. The red line marks the value of 1 . The error bars in (b) and (c) 
present the 95% confidence intervals obtained by bootstrap analysis. 

established between them. To exclude very large cliques in social rather than temporary citations with small relevance. In what fol- 
network, we leave out papers with more than 10 authors, which lows, we set 5 = 10, and the choice of s will not affect the results, 
account only for 1.95% of all the publications. The diffusion of Similarly to LJ data, this dataset also contains the complete social 
information is reflected by citations. When a scientific idea is pro- network and full records of spreading instances. The impact of indi- 
posed in a paper, the scientists who are interested in this idea will cite viduals in the spreading process is described by the size of the region 
this paper as reference in their own papers. In this way, we can track of influence as well. Since the diffusion graph is extremely dense, we 
the diffusion of scientific ideas. To extract these spreading instances, limit the BFS search to 5 layers. 

we establish a directed diffusion link from author j to author i if i has The comparisons of k-core versus degree and PageRank are pre- 
cited/s paper for more than s times. Here, we define a cutoff because sented in Fig. 2c, d and Fig. 4b. Although the spreading mechanism of 
it is desirable to capture authors' steady focus on other people's work, scientific ideas is different from that of posts in Livejournal, the 




top fraction, f top fraction, f 

Figure 4 | i-shell can recognize influential spreaders more accurately than in-degree and PageRank. The recognition rate r{f) for LJ (a), APS (b), 
Facebook (c) and Twitter (d) with fc-shell k s , in-degree k in and PageRank p. For all the datasets, k s performs better than in-degree and PageRank. The error 
bars mark the 95% confidence intervals by bootstrap. 
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results of these two cases are quite consistent: k-core outperforms 
degree and PageRank. This interesting finding indicates that k-core 
captures some generic properties of the diffusion process as a reliable 
predictor for influential spreaders across different spreading processes. 

Robustness of k-core for sampled networks. While we are able to 
obtain the data of the complete social networks of LJ and APS, such 
comprehensive datasets are usually not available for the majority of 
the online social networks, including popular platforms like 
Facebook and Twitter. Therefore, it would be desirable to identify 
spreaders even for networks where we do not have the complete 
network structure. In order to check the performance of k-core in 
networks with partial links, we analyze the subnetworks sampled 
from Facebook and Twitter. 

The Facebook data contains the friend list and the entire records of 
wall posts over a period of two years from a regional network of 
Facebook corresponding to the city of New Orleans, LA in the 
USA 52 . This network covers 60,290 users and 838,092 wall post data 
spanning from September 26th, 2006 to January 22nd, 2009. As the 
case of LJ data, the social network can be constructed with the friend 
relations. The diffusion instances can be inferred as follows: if user i 
posts comments on user/s page, we infer that i obtains information 
from j to motivate him/her to write comments. We do not infer the 
information flows in the opposite direction because there are many 
circumstances that, although i posts comments on/s page, j may not 
read these comments. This could happen, for instance, when j is a 
celebrity and i is a fan. While this dataset covers only a geographical 
community in Facebook, it has the advantage of containing the com- 
plete history of diffusion interactions. Following the previous exam- 
ples, we use the size of the region of influence as a criteria of 
significance in spreading. 

The crawl of Facebook New Orleans Network has been done by a 
snowball sampling method 52 . The sampling experiment on the LJ 
social network shows that such sampling method will not destroy 
the relative ranking of nodes for k$, k in and PageRank (see our 
detailed study in Methods). The results of partial Facebook network 
are presented in Fig. 2e, f and Fig. 4c. Consistent with the results of 
the two complete datasets, LJ and APS, k-core outperforms in-degree 
and PageRank even-though the Facebook network is incomplete. 

Another important example of large scale microblogging network 
is Twitter. Twitter is an online social networking and microblogging 
service that has gained worldwide popularity. Here we use the dataset 
of approximately 16 million tweets sampled between January 23rd 
and February 8th, 201 1 and publically shared by Twitter (http://trec. 
nist.gov/data/tweets/) 53 . The natural way to get the social network is 
to extract the follower network through Twitter API. Unfortunately, 
due to the access rate limit of Twitter API, it is impossible to obtain 
the full information of the follower network in a reasonable time. To 
approximate the social network, we use an alternative way - the 
mention network, which has been studied in many previous works 
on Twitter 48,54,55 . In contrast to the normal tweets, mentions (tweets 
containing @username) usually include personal conversations or 
references. In fact, the mention links have stronger strength of ties 
than follower links, as has been shown before 48 . Therefore, the men- 
tion network can be viewed as a stronger version of interactions 
between Twitter users. In the mention network, if user i mentions 
user j in his/her tweets, there exists a directed link from i to j. 

In order to obtain the diffusion graph, we extract retweet relations 
from the tweets. A retweet (RT @username) corresponds to content 
forward with the specified user as the nominal source. If user i 
retweets a tweet of user j, then the information propagates from j 
to i, thus establishing a diffusion link from; to In this way, the social 
network and diffusion graph of Twitter are constructed. Since the 
tweets are sampled from all published tweets during the observation 
period, we still need to check the impact of sampling method. We 
perform such sampling experiment with LJ data, in which we find 



that the relative ranking of k s , k in and PageRank are not dramatically 
affected with sampling (details in Methods). 

We should note here that this Twitter dataset has some drawbacks. 
First, given that the activity of users, which is measured by number of 
posts, is power-law distributed 56 , we are biased to observe more active 
nodes, while the less active nodes are missed. Second, even though 
the mention network can represent strong social relations, it is rela- 
tively sparser than the follower network, which is typically used in the 
studies of diffusion on Twitter 54,57,58 . Therefore, the mention network 
misses a large fraction of follower links. However, considering that 
the Twitter social graph is not available in practice, our results are in 
fact more relevant for practical purposes. Regardless of these draw- 
backs, it is still meaningful to identify the best spreaders using the 
mention -network anyway, as long as the obtained predictors from 
the topology provide consistent predictions. Indeed, we find that this 
is the case in the studied Twitter network. In Fig. 2g, h and Fig. 4d, we 
conclude that, for the tweets sampled from Twitter, k-core is more 
effective in locating capable spreaders than in-degree and PageRank. 

The measurements in these diverse datasets present empirical 
evidence that k s index is a reliable predictor for influential spreaders. 
Even though the spreading dynamics differ between the examined 
systems, the results are quite uniform suggesting that the efficiency of 
k-core could be generic. Furthermore, k-core outperforms other 
measures even in sampled networks with partial information. 

A local proxy for influence. Considering the real- world scenarios, 
evaluation of k s is frequently infeasible. Being a global measure, its 
computation requires collection and analysis of the complete social 
network. This could be a very challenging task in large online social 
networks such as Twitter. It would therefore be convenient to 
substitute k s with a local proxy capable to identify best spreaders 
efficiently when we lack global information. 

We have already seen that the most obvious candidate, the node's 
degree alone is not enough for identifying spreaders because the 
nearest neighbors of a well connected person may have low degree 
and be inefficient spreaders. Considering this effect, it's reasonable to 
assume that the more efficient spreaders are the ones who have not 
only high degree but their neighbors are also well connected. This 
reasoning can be further generalized to include second- nearest 
neighbors and so on. Indeed, the nodes located in the k-core of the 
network have well connected nearest neighbors, well connected next- 
nearest neighbors and so forth. Alternatively, hubs surrounded by 
low-degree peers are pruned early in the k-core computation because 
the majority of their low-degree peers belong to the first shells. The 
users belonging to high k-shell typically have high degree neighbors 
in all layers: not only their nearest neighbors are well connected, but 
the nodes several steps away also have large degree. 

Naively, we may think that the PageRank algorithm addresses this 
issue as well, and takes the neighbors' importance into account recur- 
sively assuming that the information is disseminated by a random 
walk process. However, given that the underlying dynamical model 
of a predictor can heavily affect its performance, and the fact that in 
reality the information does not spread in a random walk fashion 51 , 
then, PageRank does not perform as well as the k-core-based meth- 
ods. In addition, PageRank is computed globally and iteratively, and 
therefore requires complete network structure to operate while suf- 
fering from performance issues on large networks (k s is still global 
but its implementing time scales linearly with system size). 

Considering these challenges faced in the implementation of glo- 
bal algorithms, we examine a simple local measure: the sum of degree 
of the nearest neighbors k mm (i) = ^2 e v( ) ^' ^ ere lS ^ e se * °f 
the nearest neighbors of the node i. In directed networks, V(i) is the 
set of node i's followers and the degree is the in-degree. By definition, 
k sum is determined by both, the degree of the focal node i and the 
mean degree of its followers. It is much easier to obtain the data to 
compute k sum (i) than to compute k-core, since k sum (i) requires only 
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the degree of node i's nearest neighbors. The necessary data can be 
obtained with 1-step snowball sample. We further compare the per- 
formance of k sum to that of the sum of degrees of the nearest-nearest 
neighbors - k 2sum (i) = J^. eV kj (V2O) is me set °f neighbors of 
node ;"s neighbors). 

Using the diffusion data for the LJ we show that k sum outperforms 
in-degree and PageRank, see Fig. 5a and b. We compare the perform- 
ance of k s with k sum and k 2sum in Fig. 6a and b. Surprisingly, although 
k sum and k 2sum relies on partial data, they work quite well and can be 
used to identify the best spreaders. The average influence and recog- 
nition rate for k sum and k 2sum are similar to those of k s . The reason for 
this may lie in that the vast majority of cascades in reality are small, as 
recently reported in 51 . Therefore, the local information contain in 
nearest neighbors or next-nearest neighbors may be sufficient to 
accurately reflect influence. In fact, k 2sum can improve the perform- 
ance of k sum slightly, but since the number of nearest-nearest neigh- 
bors is far larger than that of nearest neighbors, it is still convenient 
and sufficient to select k sum in practice. Similar results are also 
obtained from the APS, Facebook and Twitter dataset (See Fig. 5 
and Fig. 6). 

Discussion 

Identification of the best spreaders in the population is essential for 
design of effective information dissemination strategies 44 in many 
domains including innovation, marketing campaigns, business man- 
agement and public health practices. Due to the lack of data and 
severe privacy restrictions that limit access to behavioral data 
required to directly infer performance of each user, it is important 
to develop and validate social network topological measures capable 
to identify superspreaders. Such measures would be extremely useful 
proxies for many practical scenarios. 

To address these issues we utilize a dataset representing diffusion 
of content within a complete online social network and confirm the 
relationship between the network topology and the information flow 
in the network. Moreover, we also directly validate a number of 
ranking mechanisms. To our surprise, we find that even though 
PageRank is frequently used in ranking network-based quantities 
in various domains, it performs worst among the examined measures 
to rank users' influence. For all the investigated datasets, k-core is a 
reliable and robust marker for privileged spreaders, outperforming 
the ranking schemes based on degree and PageRank. k-core does not 
only predict the average influence of nodes better, but also recognize 
the top performing spreaders more accurately. Our datasets capture 
the diffusion dynamics across the blogsphere, microblogging, online 
social networks and scientific dissemination communities. 
Furthermore, given the scale and the incompleteness of the typical 
datasets, we modify k-core to rely on local network information k sum 
and k 2sum . We confirm that such fc s -inspired measures outperform 
in-degree and PageRank in such sampled datasets. Although, the 
developed index k mm operates locally and uses partial information, 
its performance nearly matches that of the global predictor k s . We 
conclude that in practice the local information k sum can be used to 
search for influential spreaders. 

Methods 

Calculation of studied measures 

• PageRank. PageRank was originally introduced to rank web pages in the World 
Wide Web (WWW). It describes a random walk process on hyperlinked net- 
works and it is one example in the large class of eigenvector centralities. Each 
node is assigned a value according to its relative importance. A parameter d is 
introduced as the probability for a random walkers to continue browsing through 
hyperlinks, and probability 1 — d for a random walker to jumps to a random web 
page. In a network of Nnodes with adjacency matrix {ci,,}, the pagerank value p t {i) 
for node /' at time step t is given by the following equation: 



where k out {j) is the outdegree of node j. When calculating p t (0>po(0 is set to be 1 
uniformly for each node i, and the probability d is fixed as 0.85 during iterations, 
conventionally. PageRank is a global centrality that requires the complete struc- 
ture of network. 

• k-sheU decomposition. In k-shell decomposition, we first remove all nodes with 
degree k — 1 (k — k in + k out for directed networks). After that, there may appear 
some nodes with one link, so we continue pruning the system iteratively until 
there is no node with k — 1 . These removed nodes fall into a k shell with index k s 
— 1. In a similar method, we iteratively remove the next k shell, k s = 2, and 
continue removing higher-A: shells until all nodes are pruned. The largest fc S) k- 
shell index, corresponds to the k-core. As a result, each node is assigned with a 
unique k s index, and the network can be viewed as the union of all k-shells. The 
resulting classification of a node can be very different from the classification when 
the degree k is used. Only in random uncorrelated networks, such as Erdos-Renyi, 
configurational model and BA scale-free networks, there is a high correlation 
between degree and k s , where these two quantities are found to be proportional to 
each other 32 . In real-world networks, which are modular and correlated, the 
degree of the node does not determine the location in the k-shell structure. 
The only relation between these two quantities is that k ^ ks for the same node, 
by definition. 



Impact of network sampling on measures. In the crawling of the Facebook regional 
network, a breadth-first-search algorithm is used 52 . The search starts from a single 
user and visits all the friends of this user and their friends recursively until no visible 
users in New Orleans area are observed. This type of sampling method, known as 
snowball sampling, is widely used in social network crawls. Since the measures of 
spreaders can be affected by the network sampling, we need to check if the sampling 
method can seriously change the relative ranking in the original network. Here we 
explore the effect of snowball sampling on k-core, in-degree and PageRank by 
conducting sampling experiment on the complete LJ social network. The experiment 
starts by randomly selecting a source node in LJ social network. Then we crawl the 
neighbors of this source node and neighbors of its neighbors layer by layer, until the 
desired number of sampling nodes is reached. In the experiment, we set the sampling 
fraction as 1% and 5% respectively. Once we have crawled enough users, we stop the 
sampling. Figure 7a, b and c show that the relative ranking almost remains the same 
after sampling. 

The sampling of Twitter network is implemented by first selecting a fraction of 
tweets randomly and then finding the links between these users who create the 
selected tweets. To check the effect of such "activity sampling" on k s , k in and 
PageRank, we perform sampling experiment with LJ data. We randomly select 0.5% 
and 1 % posts that have been published in LJ and keep the social links between their 
authors. Figures 7c and d show that the relative ranking of ks, k in and PageRank are 
not destroyed. 

The bootstrap analysis. In Fig. 3, 4 and 5 we display the bootstrap -estimated 
confidence intervals containing the studied quantities with the 95% probability. 
Traditionally, assessment of confidence intervals is based on an assumed probability 
model for the available data. However, this approach depends on a set of assumptions 
and often lead to inaccurate approximations. The bootstrap, as an important tool in 
modern statistical analysis, overcomes the above drawbacks by repeatedly estimating 
the desired quantity in multiple random samples of the available data. Although, the 
bootstrap analysis does not provide very good approximations for extremely small 
sample sizes, it performs very well in moderate and large data sets. In larger samples, 
bootstrap -estimated confidence intervals can be more accurate than confidence 
intervals based on standard asymptotic approximations. The scale of our data set, 
containing hundreds of thousands of observations permits reliable and robust 
confidence intervals estimates using bootstrap analysis. 

The standard procedure processes as follows: we generate multiple sets of random 
samples X* s * • X* by drawing observations independently and with replacement 
from the available sample Xi,- •• , X n , then calculate the quantity in question Q* in 
each of the bootstrap samples. By generating and processing m random samples 

, ■ ■ ■ , X* we obtain a set of m estimates Qj , ■ ■ ■ , Q* n and use their distribution Q* to 
assess the likelihood that the actual Q has each particular value. 

Concretely, take the calculation of confidence intervals of M(J) in Fig. 3b as an 

example. Say we want to obtain the 1 — a confidence interval [f|»f i -§J of M(f) for k- 

shell. Initially we have n sample data Xi, ■ ■ ■ , X„, where each data X { contains the 
information of node i: X, = {k s (i)> MJ. We process as follows: 

Step 1. From the available sample Xi, ■ • • , X m we draw random samples X^, ■ ■ • , X* 
uniformly with replacement, designating them as bootstrap sample. Note that each 
observation X* contains the information of A>shell and influence for a single node. 

Step 2. Compute the average influence of the top/ fraction of the bootstrap sample 
X[, ■ ■ ■ , X* ranked by fc- shell. Denote this average influence as M(/)*. 
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Figure 6 | k sum has good performance in identifying influential spreaders. The comparisons of k s with k sum and k 2sum are shown for LJ (a, b), APS (c, d), 
Facebook (e, f) and Twitter (g, h). Error bars indicate the 95% confidence intervals. To our surprise k sum has performance comparable with k s . With more 
local information, k 2sum improve the performance slightly. 
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Figure 7 | Effect of sampling methods on k s , k in and PageRank. Snowball sampling used for Facebook data will not change the relative ranking for k s (a), 
k in (b) and PageRank (c) dramatically. Meanwhile, with the activity sampling adopted in Twitter data, the ranking for k s (d), k in (e) and PageRank (f) are 
also not affected significantly. 



Step 3. Obtain m bootstrap estimates Mf/)^, ■ ■ ■ , M(f)* m by repeating Step 1 and Step 
2 m times. Order the resulting estimates M(f)m <M(ff^ < ■■■ <M(f)^ m y Set 

'! = M (/)( [m+1| l). ? 1 -!= M (/)( 1 „ +11[l _ |]) . 

For large enough m, the process allows direct measurement of the probability to 
obtain any value of the parameter in question. Furthermore, the distribution of the 
estimates determines confidence intervals 1 — a of M(f) for each fc s index. In this 
paper, we set a — 0.05 and m = 10 5 . By altering the estimated property in Step 2, we 
adapt this technique for assessment of confidence intervals for other measures. 
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